Add WhatsApp import from decrypted backup (#136)#160
Add WhatsApp import from decrypted backup (#136)#160
Conversation
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
b3274f3 to
600d3d8
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
Add `import --type whatsapp` command for importing messages from a decrypted WhatsApp msgstore.db into the msgvault unified schema. New package internal/whatsapp/: - Reads msgstore.db as read-only SQLite - Maps WhatsApp schema to msgvault tables (conversations, participants, messages, attachments, reactions, reply threading) - Batch processing (1000 msgs/batch) with checkpoint/resume - Optional --contacts for vCard name resolution (update-only, no creation) - Optional --media-dir for content-addressed media storage - Imports: text, images, video, audio, voice notes, documents, stickers, GIFs, reactions, replies, group participants - Skips: system messages, calls, location shares, contacts, polls Security: - Media path traversal defense (sanitize, reject absolute/.. paths, boundary check against mediaDir) - Streaming hash + copy for media (no io.ReadAll, 100MB max) - E.164 phone validation (reject non-numeric JIDs, 4-15 digit range) - File permissions 0600/0750 for attachments - Per-chat reply map scoping to bound memory Query engine updates: - Sender filters in DuckDB and SQLite check both message_recipients (email path) and direct sender_id (WhatsApp/chat path) - Phone number included in sender predicates for from:+447... queries - MatchesEmpty filters account for sender_id to avoid false positives - MCP handler routes non-email from values to display name matching - Parquet cache extended with sender_id, message_type, attachment_count, phone_number, title columns - Cache schema versioning to force rebuild on column layout changes Store additions: - EnsureConversationWithType (parameterized conversation_type) - EnsureParticipantByPhone (E.164 validated, with identifier row) - UpdateParticipantDisplayNameByPhone (update-only for contacts) - EnsureConversationParticipant, UpsertReaction, UpsertMessageRawWithFormat Tested with 1.19M messages (13.5k conversations, April 2016-present).
- Add missing p_direct_sender JOIN in SQLite MatchesEmpty(ViewSenders) path so SenderName filter doesn't reference an unjoined alias - Add DB fallback for reply threading: when quoted message key_id is not in the per-chat in-memory map, look it up in the messages table to link replies from previous runs or other chats - Scope GetGmailIDsByFilter to Gmail sources only (JOIN on source_type='gmail' in SQLite, message_type filter in DuckDB Parquet fallback) to prevent WhatsApp IDs leaking into deletion/staging workflows - Remove misleading checkpoint PageToken (resume not implemented; re-runs are safe via upsert dedup)
…rmalization - ImportContacts now counts DB errors and returns an aggregated error instead of silently dropping failures from UpdateParticipantDisplayNameByPhone - Remove UK-specific 0→+44 normalization in normalizeVCardPhone; local numbers without country code are now skipped as ambiguous - Only normalize unambiguous international formats (+ prefix, 00 prefix)
…s error - MatchesEmpty(ViewSenders) now treats a sender as non-empty when either email_address or phone_number is present (both SQLite and DuckDB) - Contact import errors now cause the import command to exit with failure instead of printing a warning and returning success
…refresh - parseVCardFile now unfolds RFC 2425 continuation lines (leading space/tab) so multi-line FN and TEL values are correctly parsed - Decode QUOTED-PRINTABLE encoded FN values (e.g., =C3=A9 → é) - EnsureConversationWithType now updates conversation_type and title on existing rows when values have changed (non-empty title only) - Added tests for folded lines, encoded names, and QP decoding
vCard field names are case-insensitive per RFC 2426. Match BEGIN/END/FN/TEL using uppercased key portion while preserving original value bytes.
…erminal injection - Handle QUOTED-PRINTABLE soft line breaks (= at EOL) during vCard parsing by joining continuation lines before property extraction - Tighten phone normalization to only accept numbers with explicit country code indicators (+ or 00 prefix), avoiding false matches - Add SanitizeTerminal() to strip ANSI escape sequences and control characters from untrusted metadata (chat names, snippets) before rendering to terminal/TUI - Add tests for all three fixes
Databases created before the WhatsApp feature have no phone_number, sender_id, message_type, attachment_count, or title columns because CREATE TABLE IF NOT EXISTS is a no-op for existing tables. This breaks the MCP server and cache builder with: Binder Error: Column "phone_number" in REPLACE list not found Add ALTER TABLE migrations in InitSchema() for all v2 columns. Silently ignores "duplicate column name" errors for databases that already have the columns.
Existing v2 caches may have been built before the schema migration added phone_number to the participants table, resulting in Parquet files without the column. Bumping to v3 ensures build-cache detects the mismatch and triggers a full rebuild automatically.
Post-rebase test failures on v0.9.0After rebasing Root cause: The test schemas in Fix: Add to each test schema: attachment_count INTEGER DEFAULT 0,
sender_id INTEGER,
message_type TEXT NOT NULL DEFAULT 'email',Build passes clean, and all other test packages pass. Just these 5 tests in |
|
From Claude "The test failures are a legitimate bug in this branch, not a rebase artifact. The branch adds three new
The build cache query in cmd/msgvault/cmd/build_cache.go (lines 249-252) now references all three columns: COALESCE(TRY_CAST(m.attachment_count AS INTEGER), 0) as attachment_count, But the test helper setupTestSQLite() in build_cache_test.go (line 43) still creates the messages table This is the contributor's responsibility to fix — the branch modified the production schema and the build The CI on main passes because main doesn't have these columns in the query — they were introduced by this |
Users upgrading from pre-WhatsApp msgvault have stale Parquet cache files that lack newer columns (phone_number, attachment_count, sender_id, message_type, title). The SELECT * REPLACE(...) syntax hard-fails with a binder error when a named column doesn't exist. Probe actual Parquet schema at engine init time via DESCRIBE SELECT *. Conditionally build CTEs to REPLACE existing columns or ADD missing ones with sensible defaults. Log a warning suggesting build-cache --full-rebuild. Also fixes all 5 build_cache_test.go schemas that were missing the new columns, unblocking the test suite.
- Strip "(0)" trunk prefix before digit extraction in normalizeVCardPhone. Common in UK/European numbers: +44 (0)7700 means +447700, not +4407700. - Use file: URI for WhatsApp SQLite DSN to safely handle paths containing '?' or other special characters.
b7dcbcd to
1862cfa
Compare
Fixes pushedTwo commits addressing the test failures and a related runtime issue: 1. Graceful handling of missing Parquet columns (
2. WhatsApp bug fixes (
Full test suite green: |
roborev: Combined Review (
|
- Sanitize error messages via textutil.SanitizeTerminal before printing in OnError, preventing terminal injection from crafted WhatsApp backups containing control sequences in JIDs or conversation IDs. - Skip messages with empty key_id during import — they can't be uniquely identified for upsert and would collide/overwrite each other.
roborev: Combined Review (
|
… everywhere SanitizeTerminal now decodes full UTF-8 runes before checking for C1 control characters (U+0080–U+009F). Previously the check operated on raw bytes, so multi-byte C1 chars like U+009B (CSI, encoded as 0xC2 0x9B) bypassed the filter — an attacker could inject arbitrary ANSI escape sequences via WhatsApp chat names or message content. Added InitSchema() calls to 10 commands that were missing it: build-cache, show-message, list-senders, list-domains, list-labels, export-attachments, export-eml, repair-encoding, verify, and update-account. Without migrations, legacy databases lacking new columns (attachment_count, sender_id, message_type, phone_number, title) would fail with "no such column" errors.
roborev: Combined Review (
|
- Add union_by_name=true to read_parquet() in parquetCTEs() to handle schema drift across partition files (old partitions lack new columns) - Add direct_sender CTE to SearchFast, SearchFastCount, and SearchFastWithStats materialization, matching ListMessages pattern for WhatsApp messages that use sender_id instead of message_recipients - Update buildSearchConditions() sender/from/to/recipient filters to check phone_number and sender_id alongside email_address - Add from_phone to all SearchFast SELECT/Scan paths including the cached temp table pagination - Update test expectation for new from-filter EXISTS pattern
roborev: Combined Review (
|
…pp import Three bugs found during real-world import of 1.19M messages: 1. Groups with group_type=0 but @g.us server (Communities/sub-groups) were misclassified as direct chats — 617 groups affected. Extract isGroupChat() helper and use it consistently in importer. 2. ~50% of messages in newer groups have "lid" senders that resolved to NULL. Add fetchLidMap() to read jid_map table and resolveLidSender() fallback for message senders, reaction senders, and FTS indexing. 3. Newer groups have empty group_participants table. Add participant fallback that tracks resolved senders per chat and ensures each is registered as a conversation participant after the message loop. Also updates denormalised conversation counts (message_count, participant_count, last_message_at) after the full import.
roborev: Combined Review (
|
Summary
import --type whatsappcommand for importing messages from a decrypted WhatsAppmsgstore.dbbackupsender_idpath (not just email-basedmessage_recipients)fromvalues to display name and phone number matchingsender_id,message_type,attachment_count,phone_number, andtitlecolumnsWhat's in this PR
import --type whatsappmsgvault import --type whatsapp --phone "+447700900000" /path/to/msgstore.dbmsgstore.db(SQLite) as read-only_id)syncfull.go)--media-dirfor copying media files to content-addressed storage--contactsfor importing vCard contacts (display name resolution only — updates existing participants, does not create new ones)What's imported
Skipped: system messages (type 7), calls (15/64/66), location shares (9), contacts (10), polls (99), statuses/stories (11).
Security
..prefixed paths, verify resolved path stays within--media-dirboundaryio.TeeReader(noio.ReadAll), capped at 100MB max file size+prefix in store layerkeyIDToMsgID) cleared per-chat to prevent unbounded growth across 13k+ conversationsQuery engine updates
buildFilterConditionsnow check bothmessage_recipients(email path) and directsender_id(WhatsApp/chat path) for sender filtersfrom:+447...queriesMatchesEmptyfilters account forsender_idto avoid false positive "no sender" matches on WhatsApp messagesfromparameter: if value contains@, filters by email; if starts with+, filters by phone; otherwise filters by display nameStore additions
EnsureConversationWithType— likeEnsureConversationbut acceptsconversationTypeparameterEnsureParticipantByPhone— get-or-create by phone number with E.164 validationUpdateParticipantDisplayNameByPhone— update-only (for contacts import, does not create participants)EnsureConversationParticipant— insert-or-ignore into conversation_participantsUpsertReaction— insert-or-ignore into reactions tableUpsertMessageRawWithFormat— likeUpsertMessageRawbut accepts format parameterDesign note:
import --typevsimport-whatsappThe existing pattern uses
import-mbox/import-emlx. This PR usesimport --type whatsappwith a dispatcher, which is extensible for future sources (iMessage, Telegram, etc.) without adding new top-level commands. Happy to rename toimport-whatsappif you prefer consistency with the existing pattern.MCP support
WhatsApp messages are fully queryable via MCP after import and cache rebuild:
Test plan
go test ./internal/whatsapp/...— mapping and contacts unit testsgo test ./internal/query/...— DuckDB/SQLite engine tests (updated fixtures)go test ./internal/store/...— store tests passgo test ./internal/mcp/...— MCP handler tests passgo vet ./...— cleangosec ./...— no findings in PR files (expected false positives only)msgvault import --type whatsapp --phone "+44..." /path/to/msgstore.db --contacts /path/to/contacts.vcfmsgvault build-cache --full-rebuildafter importmsgvault search, TUI, and MCP queriesTested with a real 1.19M message WhatsApp backup (13.5k conversations, April 2016–present). Import completes successfully, messages are searchable via TUI and MCP.